NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

Allen, C; Kirtland, A; Tao, R; Lobel, S; Scott, D; Petrocelli, N; Gottesman, O; Parr, R; Littman, M; Konidaris, G (December 2024, Advances in Neural Information Processing Systems)

Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to—or knowledge of—an underlying, unobservable state space. Our metric, the λ-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD(λ) with a different value of λ. Since TD(λ=0) makes an implicit Markov assumption and TD(λ=1) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the λ-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the λ-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different λ parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.
more » « less
Full Text Available
Performance Bounds for Model and Policy Transfer in Hidden-parameter MDPs

Fu, H; Yao, J; Gottesman, O; Doshi-Velez, F; Konidaris, GD (May 2023, Proceedings of the Eleventh International Conference on Learning Representations)

In the Hidden-Parameter MDP (HiP-MDP) framework, a family of reinforcement learning tasks is generated by varying hidden parameters specifying the dynamics and reward function for each individual task. The HiP-MDP is a natural model for families of tasks in which meta- and lifelong-reinforcement learning approaches can succeed. Given a learned context encoder that infers the hidden parameters from previous experience, most existing algorithms fall into two categories: model transfer and policy transfer, depending on which function the hidden parameters are used to parameterize. We characterize the robustness of model and policy transfer algorithms with respect to hidden parameter estimation error. We first show that the value function of HiP-MDPs is Lipschitz continuous under certain conditions. We then derive regret bounds for both settings through the lens of Lipschitz continuity. Finally, we empirically corroborate our theoretical analysis by varying the hyper-parameters governing the Lipschitz constants of two continuous control problems; the resulting performance is consistent with our theoretical results.
more » « less
Full Text Available
Coarse-Grained Smoothness for Reinforcement Learning in Metric Spaces

Gottesman, O; Asadi, K; Allen, C; Lobel, S; Konidaris, GD; Littman, ML (April 2023, Proceedings of the 26th International Conference on Artificial Intelligence and Statistics)

Principled decision-making in continuous state-action spaces is impossible without some assumptions. A common approach is to assume Lipschitz continuity of the Q-function. We show that, unfortunately, this property fails to hold in many typical domains. We propose a new coarse-grained smoothness definition that generalizes the notion of Lipschitz continuity, is more widely applicable, and allows us to compute significantly tighter bounds on Q-functions, leading to improved learning. We provide a theoretical analysis of our new smoothness definition, and discuss its implications and impact on control and exploration in continuous domains.
more » « less
Full Text Available
Optimistic Initialization for Exploration in Continuous Control

Lobel, S; Bagaria, A; Allen, C; Gottesman, O; Konidaris, G.D. (February 2022, Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence)

Optimistic initialization underpins many theoretically sound exploration schemes in tabular domains; however, in the deep function approximation setting, optimism can quickly disappear if initialized naively. We propose a framework for more effectively incorporating optimistic initialization into reinforcement learning for continuous control. Our approach uses metric information about the state-action space to estimate which transitions are still unexplored, and explicitly maintains the initial Q-value optimism for the corresponding state-action pairs. We also develop methods for efficiently approximating these training objectives, and for incorporating domain knowledge into the optimistic envelope to improve sample efficiency. We empirically evaluate these approaches on a variety of hard exploration problems in continuous control, where our method outperforms existing exploration techniques.
more » « less
Full Text Available
Learning Markov State Abstractions for Deep Reinforcement Learning

Allen, C; Parikh, N; Gottesman, O; Konidaris, G.D. (December 2021, Neural Information Processing Systems 34)

A fundamental assumption of reinforcement learning in Markov decision processes (MDPs) is that the relevant decision process is, in fact, Markov. However, when MDPs have rich observations, agents typically learn by way of an abstract state representation, and such representations are not guaranteed to preserve the Markov property. We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions. Our novel training objective is compatible with both online and offline training: it does not require a reward signal, but agents can capitalize on reward information when available. We empirically evaluate our approach on a visual gridworld domain and a set of continuous control benchmarks. Our approach learns representations that capture the underlying structure of the domain and lead to improved sample efficiency over state-of-the-art deep reinforcement learning with visual features—often matching or exceeding the performance achieved with hand-designed compact state information.
more » « less
Full Text Available
Combining Parametric and Nonparametric Models for Off-Policy Evaluation

Gottesman, O.; Liu, Y.; Sussex, S.; Brunskill, E.; Doshi-Velez, F (January 2020, International Conference on Machine Learning)

Full Text Available
Representation Balancing MDPsfor Off-Policy Policy Evaluation

Liu, Y; Gottesman, O; Raghu, A; Komorowski, M; Faisal, A; Doshi-Velez, F; Brunskill, E (January 2018, Advances in neural information processing systems)

Full Text Available

Search for: All records